0.0.1 Session Aim

  • What and whys of R (15 min)
  • Where and how to use R/R-studio (15 min)
  • Setting up and ready to dive in (15 min)
  • Starting with Data in R (15 min)
  • Manipulating, analysing and exporting data (30 min)
  • Hands on and doubts (30 min)

1 What is R?

  • R is statistical programming language that suited to do high-level data analysis.
  • But in recent times it offers much more than just statistics. For a simple example this website (Yes, this is completely built in R)

2 Why learn R?

We can do lot of stuffs in R. Starting from statistical analysis to plotting graphs and figures, Writing technical documentation to making a website and lot more. Lets explore.

  • R does not involve lots of pointing and clicking, and that’s a good thing.
  • R code is great for reproducibility
  • R is interdisciplinary and extensible
  • R works on data of all shapes and sizes
  • R produces high-quality graphics
  • R has a large and welcoming community
  • Not only is R free, but it is also open-source and cross-platform

2.3 R can facilitate Reproducible Research

According to recent editorials, the reproducibility crisis is still on-going

Reality check on reproducibility

1,500 scientists lift the lid on reproducibility

Nature, May 2016

3 Setting up

Where and How to do R?

R can be done/executed using command line, or a graphical user interface (GUI). On this session, we will use the RStudio GUI.

3.1 R Homepage

http://www.r-project.org/

3.2 R-Studio

R-Studio cloud - https://rstudio.cloud/

Lets understand the R-studio interface.

4 Getting started with Data in R

4.1 Data Type in R

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them.

Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object (Also we refer as variables)

Formal data types -

  • character: "a", "swc"
  • numeric: 2, 15.5
  • logical: TRUE, FALSE
  • integer: 2L (the L tells R to store this as an integer)
  • complex: 1+4i (complex numbers with real and imaginary parts)

4.2 Data Structure in R

  • vectors
  • matrix
  • data-frame
  • list
  • factors

4.2.1 Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R.

x <- c(1, 2, 3) #numeric

Using TRUE and FALSE will create a vector of mode logical:

y <- c(TRUE, TRUE, FALSE, FALSE)

While using quoted text will create a vector of mode character:

z <- c("Sarah", "Tracy", "Jon")

4.2.1.1 Examining Vectors

The functions typeof(), length() provide useful information about your vectors and R objects in general.

typeof(z)
## [1] "character"
length(z)
## [1] 3

4.2.1.2 Adding Elements

The function c() (for combine) can also be used to add elements to a vector.

z <- c(z, "Annette")
z
## [1] "Sarah"   "Tracy"   "Jon"     "Annette"
z <- c("Greg", z)
z
## [1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

4.2.1.3 Vectors from a Sequence of Numbers

You can create vectors as a sequence of numbers.

series <- 1:10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(from = 1, to = 10, by = 0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
## [16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
## [31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
## [46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
## [61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
## [76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
## [91] 10.0

4.2.1.4 Missing Data

R supports missing data in vectors. They are represented as NA (Not Available) and can be used for all the vector types covered in this lesson:

x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
x <- c(1+5i, 2-3i, NA)

The function is.na() indicates the elements of the vectors that represent missing data, and the function anyNA() returns TRUE if the vector contains any missing values:

x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE
is.na(y)
## [1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
## [1] TRUE
anyNA(y)
## [1] FALSE

4.2.2 Matrix

In R matrices are an extension of the numeric or character vectors. Having rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.

m <- matrix(nrow = 2, ncol = 2)
m
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
dim(m)
## [1] 2 2

Content of a matrix -

m <- matrix(1:6, nrow = 2, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Note - Matrices in R are filled column-wise.

You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat <- matrix(c(1, 2, 3, 11, 12, 13),
               nrow = 2,
               ncol = 3,
               byrow = TRUE)
mdat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   11   12   13

4.2.3 Data Frame

A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.

Some additional information on data frames:

  • Usually created by read.csv() and read.table(), i.e. when importing the data into R.
  • Can also create a new data frame with data.frame() function.
  • Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.

4.2.3.1 Creating Data Frames by Hand

To create data frames by hand:

dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
##    id  x  y
## 1   a  1 11
## 2   b  2 12
## 3   c  3 13
## 4   d  4 14
## 5   e  5 15
## 6   f  6 16
## 7   g  7 17
## 8   h  8 18
## 9   i  9 19
## 10  j 10 20

Useful Data Frame Functions

  • head() - shows first 6 rows
  • tail() - shows last 6 rows
  • dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
  • nrow() - number of rows
  • ncol() - number of columns
  • str() - structure of data frame - name, type and preview of data in each column
  • names() or colnames() - both show the names attribute for a data frame
  • sapply(dataframe, class) - shows the class of each column in the data frame {: .callout} See that it is actually a special list:

Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).

dat[1, 3]
## [1] 11

As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.

dat[["y"]]
##  [1] 11 12 13 14 15 16 17 18 19 20
dat$y
##  [1] 11 12 13 14 15 16 17 18 19 20

The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.

Dimensions Homogenous Heterogeneous
1-D atomic vector list
2-D matrix data frame

5 Steps to Basic Data Analysis

  • In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
  1. Reading in data
    • read.table()
    • read.csv(), read.delim()
  2. Analysis
    • Manipulating & reshaping the data
      • perhaps dealing with “missing data”
    • Any maths you like
    • Plots
  3. Writing out results
    • write.table()
    • write.csv()

5.1 A simple walk-through of data

Lets consider this example - We are investigating the animal species diversity and weights found within plots at our study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxon e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot
download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "portal_data_joined.csv")

5.2 Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

  • Functions that import data need a file location as a character vector
  • The default location is the working directory
getwd()
  • If the file you want to read is in your working directory, you can just use the file name
list.files()
  • The file.exists function does exactly what it says on the tin!
    • a good sanity check for your code
file.exists("portal_data_joined.csv")
## [1] TRUE

5.3 Load the data into a R dataframe

surveys <- read.table("portal_data_joined.csv", header = T, sep = ",")
head(surveys)
##   record_id month day year plot_id species_id sex hindfoot_length weight
## 1         1     7  16 1977       2         NL   M              32     NA
## 2        72     8  19 1977       2         NL   M              31     NA
## 3       224     9  13 1977       2         NL                  NA     NA
## 4       266    10  16 1977       2         NL                  NA     NA
## 5       349    11  12 1977       2         NL                  NA     NA
## 6       363    11  12 1977       2         NL                  NA     NA
##     genus  species   taxa plot_type
## 1 Neotoma albigula Rodent   Control
## 2 Neotoma albigula Rodent   Control
## 3 Neotoma albigula Rodent   Control
## 4 Neotoma albigula Rodent   Control
## 5 Neotoma albigula Rodent   Control
## 6 Neotoma albigula Rodent   Control

Get to know a function

?read.table # or Fn+F1

5.4 How to quickly explore the data

Check the dimensions:

ncol(surveys)
## [1] 13
nrow(surveys)
## [1] 34786
dim(surveys)
## [1] 34786    13

The names of the columns are automatically assigned:

colnames(surveys)
##  [1] "record_id"       "month"           "day"             "year"           
##  [5] "plot_id"         "species_id"      "sex"             "hindfoot_length"
##  [9] "weight"          "genus"           "species"         "taxa"           
## [13] "plot_type"
  1. What is the class of the object surveys?
  2. How many rows and how many columns are in this object?
  3. How many species have been recorded during these surveys?
str(surveys)
## 'data.frame':    34786 obs. of  13 variables:
##  $ record_id      : int  1 72 224 266 349 363 435 506 588 661 ...
##  $ month          : int  7 8 9 10 11 11 12 1 2 3 ...
##  $ day            : int  16 19 13 16 12 12 10 8 18 11 ...
##  $ year           : int  1977 1977 1977 1977 1977 1977 1977 1978 1978 1978 ...
##  $ plot_id        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ species_id     : chr  "NL" "NL" "NL" "NL" ...
##  $ sex            : chr  "M" "M" "" "" ...
##  $ hindfoot_length: int  32 31 NA NA NA NA NA NA NA NA ...
##  $ weight         : int  NA NA NA NA NA NA NA NA 218 NA ...
##  $ genus          : chr  "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
##  $ species        : chr  "albigula" "albigula" "albigula" "albigula" ...
##  $ taxa           : chr  "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type      : chr  "Control" "Control" "Control" "Control" ...
  1. List out species
unique(surveys$species)
##  [1] "albigula"        "merriami"        "flavus"          "eremicus"       
##  [5] "spectabilis"     "penicillatus"    "hispidus"        "torridus"       
##  [9] "ordii"           "sp."             "spilosoma"       "leucogaster"    
## [13] "megalotis"       "audubonii"       "maniculatus"     "harrisi"        
## [17] "bilineata"       "melanocorys"     "squamata"        "fulvescens"     
## [21] "taylori"         "montanus"        "ochrognathus"    "baileyi"        
## [25] "brunneicapillus" "chlorurus"       "fulviventer"     "intermedius"    
## [29] "leucopus"        "viridis"         "gramineus"       "savannarum"     
## [33] "leucophrys"      "scutalatus"      "undulatus"       "fuscus"         
## [37] "tereticaudus"    "tigris"          "clarki"          "uniparens"

Lets learn indexing

# first element in the first column of the data frame (as a vector)
surveys[1, 1]   
# first element in the 6th column (as a vector)
surveys[1, 6]   
# first column of the data frame (as a vector)
surveys[, 1]    
# first column of the data frame (as a data.frame)
surveys[1]   
# The whole data frame, except the first column
surveys[, -1]
# first five
surveys[1:5, ]

By name

surveys["species_id"]
surveys$species_id   

5.5 ploting

Check hight frequency

hist(surveys$weight)

plot(surveys$weight, surveys$hindfoot_length)

library(ggplot2)
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
    geom_point(alpha = 0.1, aes(color = species))
## Warning: Removed 4048 rows containing missing values (geom_point).

5.6 Add a new variable

library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))
## Warning: 129 failed to parse.

5.7 Get summary of date

summary(surveys$date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1977-07-16" "1984-03-12" "1990-07-22" "1990-12-15" "1997-07-29" "2002-12-31" 
##         NA's 
##        "129"

5.8 Export the data

write.table(surveys, "portal_data_joined_with_date.csv")
 

Created and Maintained by Sangram Keshari Sahu
Rmarkdown Template used from Rmdplates package
Licensed under CC-BY 4.0
Source Code At GitHub